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Abstract 

In this paper, we examine the problem of approximating a general linear dimensionality reduction (LDR) operator, 
represented as a matrix A £ R mX71 w j t jj m ^ by a partial circulant matrix with rows related by circular shifts. 
Partial circulant matrices admit fast implementations via Fourier transform methods and subsampling operations; our 
investigation here is motivated by a desire to leverage these potential computational improvements in large-scale data 
processing tasks. We establish a fundamental result, that most large LDR matrices (whose row spaces are uniformly 
distributed) in fact cannot be well-approximated by partial circulant matrices. Then, we propose a natural generalization 
of the partial circulant approximation framework that entails approximating the range space of a given LDR operator 
A £ R mXn over a restricted domain of inputs, using a matrix formed as a product of a partial circulant matrix having 
m! > m rows and a mx m' “post-processing” matrix. We introduce a novel algorithmic technique, based on sparse 
matrix factorization, for identifying the factors comprising such approximations, and provide preliminary evidence to 
demonstrate the potential of this approach. 


Index Terms 

Subspace learning, circulant matrices, matrix factorization, sparse regularization, big data 

I. Introduction 

Numerous tasks in signal processing, statistics, and machine learning employ dimensionality reduction methods 
to facilitate the processing, visualization, and analysis of (ostensibly) high-dimensional data in (more tractable) low¬ 
dimensional spaces. Among the myriad of dimensionality reduction techniques in the literature, linear dimensionality 
reduction (LDR) methods remain among the most popular and widely-used. 

One well-known example is principal component analysis |TJ characterized by a linear dimensionality reduction 
(LDR) operator designed to maximally preserve the variance of the original data in the projected space, and which 
has been widely used for data compression, denoising, and as a dimensionality reducing pre-processing step for other 
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analyses (e.g., clustering, classification, etc.) |2|. Other classical data analysis methods that employ specialized LDR 
operators include linear discriminant analysis (where the operator is designed to preserve separations among original 
data points belonging to disparate classes), and canonical correlations analysis (where the operator maximizes 
correlations among projected data points). See the survey paper [!3J for additional examples. 

In recent years, LDR methods have also been utilized for universal “precompression” in high-dimensional 
inference tasks. For example, fully random LDR operators are at the essence of the initial investigations into 
compressed sensing (CS) (see, e.g., 0-0); and other, more stmctured, LDR operators - both non-adaptive (see, 
e.g-, @-GZ)) and adaptive (see, e.g., [18]-|32j) - have also been examined recently in the context of CS and 
sparse inference. 

The computational efficiency of LDR methods is often cited as one of their primary virtues; however, as data 
of interest become increasingly large-scale (high-dimensional, and numerous), even the relatively low computa¬ 
tional complexity associated with LDR methods can become significant. In this paper, we investigate the utility 
of employing partial circulant approximations to general LDR operators; partial circulant matrices admit fast 
implementations (via convolution or Fourier transform methods, and subsampling), and their use as surrogates to 
arbitrary LDR matrices may provide potentially significant computational efficiency improvements in practice. 

A. Problem Statement and Our Contributions 

We represent an arbitrary (real) linear dimensionality reduction (LDR) operator as a matrix A £ K mxn , with 
m < n. Operating with A on an arbitrary vector x £ M" generally requires 0(mn) operations, which can be 
superlinear in n For even modest values m (e.g., when m = nP for /3 £ (0,1] the complexity is 0(n 1+ ^)). Here, 
we seek computationally-efficient approximations of A implementable via convolution and downsampling (and, 
perhaps, modest post-processing). 

We first consider approximating A as A « SC, where C is an n x n circulant matrix and S is a matrix 
with m rows that are a (permuted) subset of I n (n x n identity). These partial circulant approximations enjoy 
an implementation complexity of 0(n log?r), owing to fast Fourier transform methods. Despite these potential 
implementation complexity benefits, our first contribution establishes that most LDR matrices, in fact, cannot be 
well-approximated by partial circulant matrices. 

We then propose a generalization that uses approximations of the form A « PSC, where C and S are as above, 
except that S has some m' > m rows, and P is an m x m' “post-processing” matrix. Operating with such matrices 
requires 0(mm' + nlogn) operations in general, and can be as low as O(nlogn), e.g., when m! = ©(n 1 / 2 ). 
Within this framework, we also exploit the fact that signals of interest often reside on a restricted input domain 
(e.g., they lie on a union of subspaces, manifold, etc.), so we may restrict our approximation to mimicking the 
action of A on these inputs. We describe a data-driven approach to learning the factors of the approximating matrix 
in these settings, and provide empirical evidence to demonstrate the viability of this approximation approach. 
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B. Connections to Existing Work 

Circulant approximations to square matrices are classical in linear algebra; for example, circulant preconditioners 
for linear systems were examined in |[33]-|35j. In a line of work motivated by “optical information processing,” 
several efforts have examined fundamental aspects of approximating square matrices by products of circulant and 
diagonal matrices j36)-g§. Here, our focus is on LDR matrices (not square matrices), so results from these works 
are not directly applicable here. 

Random partial circulant matrices have been studied recently in compressed sensing tasks EH -J45|, and random 
partial circulant matrices with diagonal pre-processors have been proposed as computationally efficient methods 
for Johnson-Lindenstrauss (JL) embeddings in The work J48) established the viability of using random 

partial circulant matrices for embedding manifold-structured data. In contrast to these works, here our aim is to 
approximate the action of given LDR matrix, not necessarily to perform JL embeddings. 


C. Outline 

Following the introduction of a few preliminaries (below), we establish in Sect. [II] a fundamental approximation 
result regarding partial circulant approximations to general LDR matrices. In Sect. [Ill] we propose a generalized 
approach to the partial circulant approximation problem, and provide empirical evidence to demonstrate its efficacy. 
A brief discussion is provided in Sect. |IV| proofs appear in the Appendix. 


D. Preliminaries 

We introduce some preliminary concepts and notation that will be used throughout the sequel. First, we define 


R = 


(n-l)xl 

1 


0 


lx(n-l) 


to be the “right rotation” matrix, in that post-multiplication of a row vector by R implements a circular shift to the 
right by one position. Analogously, post-multiplying a row vector by R T implements a circular shift to the left by 
one position. To simplify notation we write L = R r ; note that LR = I n . 

We represent an n x n (real) circulant matrix by 


C = 


Cl 

c 2 • 

■ Cn 


C T 

Cn 

Cl • 

Cn— 1 

= 

c t R 

_c 2 

c 3 • 

Cl 


c r R’ 1 - 1 


( 1 ) 


where c = [ci 


'. We let C n denote the set of all (real) n-dimensional circulant matrices of the form 


Q. For m < n, the set of all m x n real partial circulant matrices is 

VC m , n = {SC € M mxn | S g S m , C G C n j , 

where S m is the set of to x n row sampling matrices whose rows comprise to different canonical basis vectors of 
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For a matrix A £ R mxrl , we denote its m individual rows by A, . for i = 1, 2,..., m and its n columns by 
A-j £ for j = 1 We write the squared Frobenius norm of A as ||A||^ = JA ■ |Ajj| 2 , and its 1,2 

norm by ||A||i ; 2 = Y^j=i ||-A. :>J -1|2» where 11A : 11 2 denotes the Euclidean norm of A.j. 


II. A Fundamental (Negative) Approximation Result for Partial Circulant Matrices 
As above, we let A £ R m x 71 denote the LDR matrix we aim to approximate using a partial circulant matrix 
of the same dimensions. Here, we consider a tractable (and somewhat natural) choice of the approximation error 
metric, and seek to find the matrix Z £ PC m „ closest to A in the Frobinius sense. In this setting, the minimum 
approximation error is given by 

£pc m „( A) = min ||A - Z|||. (2) 

’ zet>c„,,„ 

Finding the minimizer of <(2]) is a non-convex optimization problem, owing to the product nature of elements of 
VC m , n . Nevertheless, we obtain a precise characterization of the minimum achievable approximation error for a 
given A via the following lemma (whose proof appears in the appendix). 

Lemma 2.1: For A £ R mxra , we have 


where 


£j>c.,.(A) = ||A|| 2 p -R 2 (A), 

R(A) = m «“^' A ‘ 2 L '‘ll 2 

fGJ 7 


(3) 


(4) 


and T = jf = [/i-../ m ] £ {0, l} m f, ^ f 0 Vi ^ j}. 

We call the term R{ A) in (|4]i the Rubik’s Score of the matrix A, inspired by the fact that I HA) is maximized 
when the circular shifts of the rows {A,, } are “maximally aligned” (indeed, the numerator of Q is a sum of 
rotated rows of A). 

Evidently, 0 < R( A) || A|| p, with — |.A.11 p foi all .A. £ 72 , but beyond these boundary cases. 

Lemma [2~T| yields limited interpretational insight into the achievable error for any particular A. We gain additional 
insight here using a probabilistic technique - instead of quantifying the approximation error for a fixed A, we 
consider instead matrices A whose row spaces are distributed uniformly at random on Gr(m, n), the Grassmannian 
manifold of m-dimensional linear subspaces of Si" . Using this model, we can quantify the proportion of matrices 
so drawn whose optimal partial circulant approximation error is at most a fixed fraction (say S) of their squared 
Frobenius norm. 

To this end, we exploit the fact that matrices whose row-spaces are uniformly distributed on Gr(m, n) may be 
modeled as matrices having iid zero-mean Gaussian elements. With this, we establish the following theorem (proved 
in the appendix). 

Theorem 2.1: For 2 < m < n, let A £ R mxn have iid A/”(0, 1) entries. Then for <5 £ [0,0.125), and n is 
sufficiently large, there exists a positive constant c(S) such that 

Pr(£ PCm JA) < S\\A\\ 2 f ) = 0(e~ c ^- mn ). 
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Algorithm 1 “Data-Driven” Partial Circulant Approximation 
Inputs: LDR matrix A £ E r " x ", parameters A. /i, e > 0, 

Matrix of “representative” data X £ R nxp , 

Initialize: M ( °) = US (from the SVD AX = USV T ) 

obj (0) = ||AX||| 

repeat 

CW = argmin ce c„ ||AX - M^CXIII + p\\C\\ 2 F 
MW = argmin MeRm xn || AX - MC«X|||, + A||M|| 2j1 
obj (t) = ||AX - M^C^XIII + /i||cW||J. + A||M«|| 2 ,i 
until obj (t) - obj (t_1) < e • obj (t_1) 

Output: M* = = C (t) 


Simply put, the content of Theorem 2.1 is that the proportion of large matrices (with uniformly distributed row 


spaces) that can be approximated to high accuracy by partial circulant matrices is exponentially small in the product 
of the matrix dimensions. In the next section we propose a more general framework designed to facilitate accurate 
partial circulant approximations in a number of practical applications. 


III. A Generalized “Data-Driven” Partial Circulant Approximation Approach 

be pessimistic is, perhaps, intuitive, since A £ R mx " has 0(mn ) “degrees 


That the conclusion of Theorem 


2.1 


of freedom,” while any C £ VCm tn has only 0(n + m). These structural limitations are, of course, fundamental. 
We gain additional leverage here by exploiting a more general approximation model, and exploiting additional latent 
“signal space characteristics” that arise in many practical applications. 

Formally, we consider approximations of A of the form A ss PSC where C £ C n , S is an rn' x n matrix 
comprising a permuted subset of rows of identity, and P £ W nxm is a “post-processing” matrix. This extension 
allows approximations of A whose rows are generally linear combinations of the rows of a partial circulant matrix 
rather than being rows of a partial circulant matrix themselves, facilitating accurate approximation of any matrix 
whose row space is spanned by vectors related by circular shifts. 

Further, we exploit the fact that in many applications where LDR methods are deployed (e.g, PCA, LDA, 
Compressive Sensing, etc.), the high-dimensional data to be processed are not arbitrary, but instead lie in some 
restricted input domain (e.g., they lie in a low-dimensional subspace, a union of low-dimensional subspaces, distinct 
clusters, etc.). In these cases, our approximation task reduces to a simpler task of mimicking the action of A on 
these restricted inputs. 

Suppose X £ R nxp is a matrix whose columns are “representative” of the restricted input domain for the problem 
of interest. Then, minimizing an objective of the form ||AX — PSCX|||, essentially ensures the approximation be 
accurate in an (empirical) average sense. We propose an algorithmic approach based on alternating minimization and 
regularized matrix factorization for identifying the factors comprising this more general approximation; the method 
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Fig. 1. Data-driven approximation of matrices comprised of top principal components of a training set of ‘2’ digits from the USPS dataset. 
The panels are (from left to right) average approximation error vs. m !, histogram of normalized training errors, and histogram of normalized 
test data errors. The top and bottom rows correspond to matrices A whose rows are the top 5 and top 10 principal component direction vectors, 
respectively. 


is summarized as Algorithm [T] Note that in our approach, we effectively combine the action of the sampling and 
post processing matrices; we let PS = M, and seek column sparsity in M using the ||M|| 2 ,i regularization term, 
which penalizes the number of nonzero columns of M (the resulting number m' of nonzero columns depends on 
the regularization parameter A > 0). We also include a (simple) regularization term for C £ R" x ” to fix scaling 
ambiguities; leaving C unconstrained would allow elements of M to become arbitrarily small, and would circumvent 
the effect of the column-sparsity regularization. Both of these sub-problems are convex, and can be solved using 
off-the-shelf software (e.g., SLEP |49|). 

To evaluate the performance of this approach, we chose X to be a matrix of training data whose columns 
are 500 vectorized 16 x 16 images of the digit ‘2’ from the USPS handwritten digit dataset (available at http: 
//www.cs.nyu.edu/~roweis/data.html), and held out the remaining images of the digit ‘2’ to form a “test” data set. 
We select the rows of the matrix A to be a subset of the top principal component vectors of the training set X, and 
we consider two settings corresponding to A having 5 rows and 10 rows. Then, we applied the procedure of Alg. [I] 
with // = 0.1 and varying values of A to obtain different factorizations M* and C*. For each, we quantified the 
normalized error on the training set (on a logarithmic scale) vs. the column sparsity of M*. For a convenient choice 
of A, we also computed histograms of the normalized approximation errors (|j Ax — MCx|||/||xj||) for each point 
in the testing and training data sets. The results are provided in Fig. [Tj and illustrate that accurate approximations 
can indeed be obtained for reasonable values of m' for each case, and that the approximations generalize well to 
the test data, in that the approximation error histogram is concentrated around small values for each (R 2% error 
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for most points). 


IV. Discussion and Conclusions 

In this work we established a fundamental result regarding partial circulant approximations of general matrices, 
and proposed and demonstrated a general approach for approximating the action of general linear dimensionality 
reduction (LDR) operators over restricted input domains. The convolutional nature of the approximations we consider 
here, in addition to their computational expediency, makes them viable for implementation using LTI systems, or 
for specialized applications (e.g., RADAR) where convolutional models arise naturally. 

A more general investigation motivated by computational complexity considerations would also include other 
specialized matrix structures with low implementation complexities (e.g., sparse matrices or “fast” or sparse Johnson- 
Lindenstrauss embeddings (50), (5TJ). A thorough, unifying analysis of such “low-complexity” approximations is 
the topic of our ongoing work, and will be reported in a future publication. 


Appendix 


A. Proof of Lemma \2.1\ 

We first write £pc,„ „ (A) in (2]i equivalently 


£vo (A) = min min 

’ SGS m CeC„ 


j A — SC||fi. 


(5) 


Now, a key insight is that each choice of S € S rn corresponds to a vector f from the set T defined in Lemma 2.1 


as including the i-th row of I„ in S corresponds to selecting the i-th row of C, which is given from <[T} by c t R* 1 . 
Using this we may parameterize the choice of S in terms of f, and rewrite the objective function in (5]) as 

m 

IIA — SC|||. = ^2 ||Aj, ; — c T R/ i ll^ 

i= 1 
m 

= ^2 ||Aj 7 ;H 2 + c t R^L^c — 2Aj >: L^c. 

i =1 

Thus, since R^ = I„, we have 

£t>c (A) = min min || A|| % + mc T c — 2 ( A, | c. (6) 

m ’ feJ^ceR" \'f-—^ " / 

We first minimize the objective with respect to c, keeping f fixed to any arbitrary value in T. This is an 
unconstrained strictly convex quadratic problem whose minima can be obtained by equating the gradient (with 
respect to c) to zero. We identify the optimal c to be c* = ( A,, : ) r . Substituting c* into (6), and 

simplifying, we obtain 

£ vc m , n { A) = min||A|||, - — 
feJ 7 m 

which gives (3]), using the definition (4]i of the Rubik’s Score. 


X>-A- 


i =1 
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B. Proof of Theorem \2.1\ 

Note that IJ 7 ! = n\/(n — m)\ (since it is just the number of ways of choosing m elements out of n without 
replacement), and we have (trivially) that n\/{n — m)\ < n m , so 


Pr(£pc m ,„(A) < (5||A||^) 


= Pr (1 - <5)||A||^ < sup 

V ft? 


E™iA 



(i-<y)||A|||.< 




m 


Pr (l-«y)||A||f.< 




where the last step follows from union bounding. 

Now, we can simplify further by introducing the shorthand notation a = [A, . A 2 . ... A m .] T £ R mn and 
Rf = [R-^ 1 R^ 2 . ,.R^ m ] £ SQ ||A||^ = a T a and || E™ 1 Ai,:L-^ i |||/m = a T (RjTRf/m)a. It 

is easy to check that (R f r Rf /m)R f r = RjT, which implies that the columns of R^ are eigenvectors of the 
matrix corresponding to eigenvalue 1. Further, since (RjTR f/m) is symmetric, it admits an eigendecomposition 
(Rf'Rf/m) = UfEfUjT with Uf orthonormal and Ef diagonal, and since it is rank n, Ef has exactly n entries 
being 1 (and the rest 0). 

Incorporating this insight into the probabilistic analysis above, we obtain that 


P rCd-^IAII^lS^FNl 

\ m 

= Pr (a T U f ((1 — i5)I mn — S f )) U^a < 0) 

= Pr(a r ((l-5)I mn -E f ) a < 0) , 


where the components of a = Ufa are iid Af(0, 1) due to the unitary invariance of the Gaussian distribution. Thus, 
with a slight overloading of notation, we may write that 


Pr(S VCm JA) < S\\A\\ 2 f ) 

( m m \ 

Ell A vlli<^Ell A vll2 • (7) 

i =2 t =1 ) 

At this point, we note that vectorizing (row-wise) a random matrix A £ K" 1 x " having iid zero-mean Gaussian 
elements yields an mn-dimensional vector whose direction is selected uniformly at random from Gr( 1, ran). Thus, 
E”Li 11 A.,-. || 2 quantifies the length of the vector, while E "=2 ||Aj : ||| describes the energy retained after projecting 
the vector onto the fixed n(m — 1)-dimensional subspace spanned by the last n(m — 1) coordinates. It follows that 
the probability on the right-hand side of ([7]) may be interpreted in terms of the energy retained after projecting 
a fixed mn -dimensional unit-normed vector onto a subspace selected uniformly at random from Gr (n,mn). To 
quantify this, we use the following. 
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Lemma A.l (Adapted from Thm. 2.14 of [52^): Let v be a fixed unit-length vector in W a randomly oriented 
£;-dimensional subspace, and w the projection of v onto W. For 0 < e < 1, 

Pr (j|w|| 2 < (1 - £)s/kjd s j < 3e" fe2/64 . 

Using this result (with k = n(m — 1) and d = mri) we obtain that for d < 1 — 1/m, 

Pr(£pcr m . n (A) < tf||A||!) < n m Pr 


< 

< 

Now, it is straightforward to verify that for S < 1/8 = 
2v2i5), there exists a positive constant 




3 n m e 


3 e 


m log n— 1] | 1-2 


0.125 and n sufficiently large, so that n/logn > 128/(1 — 


for which 


In this case, it follows that 


=< { > < is ( ! 


, 1 - 2AS) - ^ 


m log n — 


n(m — 1) 
64 


1 - 2 , 


1- i 


< —c(5)mn. 


Pr(£-p Cm „ (A) < <5||Aj||.) < 3e~< s > mn . 
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